Progressive adaptive routing in a dragonfly processor interconnect network

ABSTRACT

A multiprocessor computer system comprises a dragonfly processor interconnect network that comprises a plurality of processor nodes and a plurality of routers. The routers are operable to adaptively route data by selecting from among a plurality of network paths from a target node to a destination node in the dragonfly network based on one or more of network congestion information from neighboring routers and failed network link information from neighboring routers.

CLAIM OF PRIORITY

This application is a continuation (and claims the benefit of priorityunder 35 U.S.C. §120) of U.S. application Ser. No. 13/290,507, filed onNov. 7, 2011 and entitled PROGRESSIVE ADAPTIVE ROUTING IN A DRAGONFLYPROCESSOR INTERCONNECT NETWORK; which application claims the benefit ofpriority to U.S. Provisional Patent Application Ser. No. 61/410,636filed on Nov. 5, 2010 and entitled “PROGRESSIVE ADAPTIVE ROUTING IN ADRAGONFLY PROCESSOR INTERCONNECT NETWORK,” filed on Nov. 5, 2010, whichis hereby incorporated by reference herein in its entirety. Thedisclosures of the prior applications are considered part of and areincorporated by reference in their entirety in the disclosure of thisapplication.

FIELD OF THE INVENTION

The invention relates generally to computer interconnect networks, andmore specifically in one embodiment to progressive adaptive routing in adragonfly topology processor interconnect network.

LIMITED COPYRIGHT WAIVER

A portion of the disclosure of this patent document contains material towhich the claim of copyright protection is made. The copyright owner hasno objection to the facsimile reproduction by any person of the patentdocument or the patent disclosure, as it appears in the U.S. Patent andTrademark Office file or records, but reserves all other rightswhatsoever.

BACKGROUND

Computer systems have long relied on network connections to transferdata, whether from one computer system to another computer system, onecomputer component to another computer component, or from one processorto another processor in the same computer. Most computer networks linkmultiple computerized elements to one another, and include variousfunctions such as verification that a message sent over the networkarrived at the intended recipient, confirmation of the integrity of themessage, and a method of routing a message to the intended recipient onthe network.

Processor interconnect networks are used in multiprocessor computersystems to transfer data from one processor to another, or from onegroup of processors to another group. The number of interconnectionlinks can be very large with computer systems having hundreds orthousands of processors, and system performance can vary significantlybased on the efficiency of the processor interconnect network. Thenumber of connections, number of intermediate nodes between a sendingand receiving processing node, and the speed or type of connection allplay a factor in the interconnect network performance.

Similarly, the network topology, or pattern of connections used to tieprocessing nodes together affects performance, and remains an area ofactive research. It is impractical to directly link each node to eachother node in systems having many tens of processors, and all butimpossible as the number of processors reaches the thousands.

Further, the cost of communications interfaces, cables, and otherfactors can add significantly to the cost of poorly designed orinefficient processor interconnect networks, especially where longconnections or high-speed fiber optic links are required. A processorinterconnect network designer is thereby challenged to provide fast andefficient communication between the various processing nodes, whilecontrolling the number of overall links, and the cost and complexity ofthe processor interconnect network.

The topology of a network, or the method used to determine how to link aprocessing node to other nodes in a multiprocessor computer system, istherefore an area of interest.

SUMMARY

The invention comprises in one example a multiprocessor computer systemhaving a dragonfly processor interconnect network that comprises aplurality of processor nodes and a plurality of routers. The routers areoperable to adaptively route data by selecting from among a plurality ofnetwork paths from a target node to a destination node in the dragonflynetwork based on one or more of network congestion information fromneighboring routers and failed network link information from neighboringrouters.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of a dragonfly network topology, consistentwith an example embodiment of the invention.

FIG. 2 is a graph illustrating scalability of a dragonfly network innodes for various router radices, consistent with an example embodimentof the invention.

FIG. 3 is a block diagram illustrating a dragonfly network topology,consistent with an example embodiment of the invention.

FIG. 4 is block diagram of dragonfly network topology groups, consistentwith some example embodiments of the invention.

FIG. 5 is a block diagram of a dragonfly network illustrating minimaland non-minimal routing using virtual channels, consistent with anexample embodiment of the invention.

FIG. 6 is a graph illustrating latency v. offered load for a variety ofrouting algorithms using various traffic patterns, consistent with anexample embodiment of the invention.

FIG. 7 is a node group diagram of a dragonfly topology networkillustrating adaptive routing via global channels using backpressurefrom intermediate nodes, consistent with an example embodiment of theinvention.

FIGS. 8A-8B are node diagrams illustrating credit round trip latencytracking, consistent with an example embodiment of the invention.

FIG. 9 shows a router configuration, consistent with an exampleembodiment of the invention.

FIG. 10 shows a group of nodes in a dragonfly processor interconnectnetwork, consistent with an example embodiment of the invention.

FIG. 11 shows connections between several node groups in a dragonflyprocessor interconnect network, consistent with an example embodiment ofthe invention.

FIG. 12 shows a router table configuration for a dragonfly processorinterconnect network router, consistent with an example embodiment ofthe invention.

DETAILED DESCRIPTION

In the following detailed description of example embodiments of theinvention, reference is made to specific examples by way of drawings andillustrations. These examples are described in sufficient detail toenable those skilled in the art to practice the invention, and serve toillustrate how the invention may be applied to various purposes orembodiments. Other embodiments of the invention exist and are within thescope of the invention, and logical, mechanical, electrical, and otherchanges may be made without departing from the subject or scope of thepresent invention. Features or limitations of various embodiments of theinvention described herein, however essential to the example embodimentsin which they are incorporated, do not limit the invention as a whole,and any reference to the invention, its elements, operation, andapplication do not limit the invention as a whole but serve only todefine these example embodiments. The following detailed descriptiondoes not, therefore, limit the scope of the invention, which is definedonly by the appended claims.

Interconnection networks are widely used to connect processors andmemories in multiprocessors, as switching fabrics for high-end routersand switches, and for connecting I/O devices. As processor and memoryperformance continues to increase in a multiprocessor computer system,the performance of the interconnection network plays a central role indetermining the overall performance of the system. The latency andbandwidth of the network largely establish the remote memory accesslatency and bandwidth.

A good interconnection network typically designed around thecapabilities and constraints of available technology. Increasing routerpin bandwidth, for example, has motivated the use of high-radix routersin which increased bandwidth is used to increase the number of ports perrouter, rather than maintaining a small number of ports and increasingthe bandwidth per port. The Cray Black Widow system, one of the firstsystems to employ a high-radix network, uses a variant of thefolded-Clos topology and radix-64 routers—a significant departure fromprevious low-radix 3-D torus networks. Recently, the advent ofeconomical optical signaling enables topologies with long channels.However, these long optical channels remain significantly more expensivethan short electrical channels. A Dragonfly topology was thereforeintroduced, exploiting emerging optical signaling technology by groupingrouters to further increase the effective radix of the network.

The topology of an interconnection network largely determines both theperformance and the cost of the network. Network cost is dominated bythe cost of channels, and in particular the cost of the long, global,inter-cabinet channels. Thus, reducing the number of global channels cansignificantly reduce the cost of the network. To reduce global channelswithout reducing performance, the number of global channels traversed bythe average packet must be reduced. The dragonfly topology reduces thenumber of global channels traversed per packet using minimal routing toone.

Dragonfly Topology Example

To achieve this global diameter of one, very high-radix routers, with aradix of approximately 2√N (where N is the size of the network) areused. While radix 64 routers have been introduced, and a radix of 128 isfeasible, much higher radices in the hundreds or thousands are needed tobuild machines that scale to 8K-1 M nodes if each packet is limited toonly one global hop using traditional very high radix router technology.To achieve the benefits of a very high radix with routers withoutrequiring hundreds or thousands of ports per node, the Dragonfly networktopology proposes using a group of routers connected into a subnetworkas one very high radix virtual router. This very high effective radix inturn allows us to build a network in which all minimal routes traverseat most one global channel. It also increases the physical length of theglobal channels, exploiting the capabilities of emerging opticalsignaling technology.

Achieving good performance on a wide range of traffic patterns on adragonfly topology involves selecting a routing algorithm that caneffectively balance load across the global channels. Global adaptiverouting (UGAL) can perform such load balancing if the load of the globalchannels is available at the source router, where the routing decisionis made. With the Dragonfly topology, however, the source router is mostoften not connected to the global channel in question. Hence, theadaptive routing decision is made based on remote or indirectinformation.

The indirect nature of this decision leads to degradation in bothlatency and throughput when conventional UGAL (which uses local queueoccupancy to make routing decisions) is used. We propose twomodifications to the UGAL routing algorithm for the Dragonfly networktopology that overcome this limitation with performance resultsapproaching an ideal implementation using global information. Addingselective virtual-channel discrimination to UGAL (UGAL-VC H) eliminatesbandwidth degradation due to local channel sharing between minimal andnon-minimal paths. Using credit-round trip latency to both sense globalchannel congestion and to propagate this congestion information upstream(UGAL-CR) eliminates latency degradation by providing much stifferbackpressure than is possible using only queue occupancy for congestionsensing.

High-radix networks reduce the diameter of the network but requirelonger cables compared to low-radix networks. Advances in signalingtechnology and the recent development of active optical cablesfacilitate implementation of high-radix topologies with longer cables.

An interconnection network is embedded in a packaging hierarchy. At thelowest level, the routers are connected via circuit boards, which arethen connected via a backplane or midplane. One or more backplanes arepackaged in a cabinet, with multiple cabinets connected by electrical oroptical cables to form a complete system. The global (inter-cabinet)cables and their associated transceivers often dominate the cost of anetwork. To minimize the network cost, the topology should be matched tothe characteristics of the available interconnect technologies, such ascost and performance.

The maximum bandwidth of an electrical cable drops with increasing cablelength because signal attenuation due to skin effect and dielectricabsorption increases linearly with distance. For typicalhigh-performance signaling rates (10-20 Gb/s) and technology parameters,electrical signaling paths are limited to about 1 m in circuit boardsand 10 m in cables. At longer distances, either the signaling rate mustbe reduced or repeaters inserted to overcome attenuation.

Historically, the high cost of optical signaling limited its use to verylong distances or applications that demanded performance regardless ofcost. Although optical cables have a higher fixed cost, their ability totransmit data over long distances at several times the data rate ofcopper cables results in a lower cost per unit distance than electricalcables. Based on the data available using current technologies, thebreak-even point is at 10 m. For distances shorter than 10 m, electricalsignaling is less expensive. Beyond 10 m, optical signaling is moreeconomical. The Dragonfly topology exploits this relationship betweencost and distance. By reducing the number of global cables, it minimizesthe effect of the higher fixed overhead of optical signaling, and bymaking the global cables longer, it maximizes the advantage of the lowerper-unit cost of optical fibers.

The dollar cost of a dragonfly also compares favorably to a flattenedbutterfly for networks larger than 1 k nodes, showing approximately a10% savings for up to 4 k nodes, and approximately a 20% cost savingsrelative to flattened butterfly topologies for more than 4 k nodes asthe dragonfly has fewer long, global cables. Folded Clos and 3-d torusnetworks suffer in comparison, because of the larger number of cablesneeded to support high network diameters. For a network of only 1 knodes, the dragonfly is 62% the cost of a 3-d torus network and 50% thatof a folded Clos network. This reduction in network cost is directlycorrelated to a reduction in network power consumed, which is asignificant advantage for large networks as well as for installationsthat are desirably environmentally friendly.

The example embodiments of a dragonfly network presented here show howuse of a group of routers as a virtual router can increase the effectiveradix of a network, and hence reduce network diameter, cost, andlatency. Because the dragonfly topology reduces the number global cablesin a network, while at the same time increasing their length, thedragonfly topology is particularly well suited for implementations usingemerging active optical cables—which have a high fixed cost but a lowcost per unit length compared to electrical cables. Using active opticalcables for the global channels, a dragonfly network reduces cost by 20%compared to a flattened butterfly and by 52% compared to a folded Closnetwork of the same bandwidth.

To show an example Dragonfly network topology, the following symbols areused in the description of the dragonfly topology and in example routingalgorithms presented later:

-   -   N Number of network terminals    -   p Number of terminals connected to each router    -   a Number of routers in each group    -   k Radix of the routers    -   k_ Effective radix of the group (or the virtual router)    -   h Number of channels within each router used to connect to other        groups    -   g Number of groups in the system    -   q Queue depth of an output port    -   qvc Queue depth of an individual output VC    -   H Hop count    -   Outi Router output port i

The Dragonfly topology is a hierarchical network with three levels, asshown in FIG. 1: routers (104, 105, and 106), groups (101, 102, and103), and system. At the router level, each router has connections to pnodes, a+1 local channels—to other routers in the same group—and hglobal channels—to routers in other groups. Therefore the radix (ordegree) of each router is defined as k=p+a+h−1. A group consists of arouters connected via an intra-group interconnection network formed fromlocal channels, as shown at 101 in FIG. 1. Each group has ap connectionsto terminals and ah connections to global channels, and all of therouters in a group collectively act as a virtual router with radixk′=a(p+h). This very high radix, k′>>k enables the system level networkto be realized with very low global diameter (the maximum number ofexpensive global channels on the minimum path between any two nodes). Upto g=ah+1 groups (N=ap(ah+1) terminals) can be connected with a globaldiameter of one. In contrast, a system-level network built directly withradix k routers would require a larger global diameter.

In a maximum-size (N=ap(ah+1)) dragonfly, there is exactly oneconnection between each pair of groups. In smaller dragonflies, thereare more global connections out of each group than there are othergroups. These extra global connections are distributed over the groupswith each pair of groups connected by at least _ah+1 g_ channels.

The dragonfly parameters a, p, and h can have any values. However, tobalance channel load, the network in this example has a=2p=2h. Becauseeach packet traverses two local channels along its route (one at eachend of the global channel) for one global channel and one terminalchannel, this ratio maintains balance. Because global channels areexpensive, deviations from this 2:1 ratio are done in some embodimentsin a manner that overprovisions local and terminal channels, so that theexpensive global channels remain fully utilized. That is, the network isbalanced in such examples so that a≧2h, 2p≧2h.

The scalability of a balanced dragonfly is shown in FIG. 2. Byincreasing the effective radix, the dragonfly topology is highlyscalable—with radix-64 routers, the topology scales to over 256 k nodeswith a network diameter of only three hops. Arbitrary networks can beused for the intra-group and inter-group networks in FIG. 1. In theexample presented here, we use a 1-D flattened butterfly or acompletely-connected topology for both networks. A simple example of thedragonfly is shown in FIG. 3 with p=h=2 (two processing nodes per routerand two channels within each router coupled to other groups), a=4 (fourrouters in each group) that scales to N=72 (72 nodes in the network)with k=7 (radix 7) routers. By using virtual routers, the effectiveradix is increased from k=7 to k′=16, as group G₀ of FIG. 3 has eightglobal connections and eight node connections.

The global radix, k′, can be increased further by using ahigher-dimensional topology for the intra-group network. Such a networkmay also exploit intra-group packaging locality. For example, a 2-Dflattened butterfly is shown in FIG. 4 at 401, which has the same k′ asthe group shown in FIG. 5 but exploits packaging locality by providingmore bandwidth to local routers. A 3-dimension flattened butterfly isused in FIG. 4 at 402 to increase the effective radix from k′=16 toK′=32—allowing the topology to scale up to N=1056 using the same k=7router as in FIG. 1.

To increase the terminal bandwidth of a high-radix network such as adragonfly, channel slicing can be employed. Rather than make thechannels wider, which would decrease the router radix, multiple networkcan be connected in parallel to add capacity. Similarly, the dragonflytopology in some embodiments can also utilize parallel networks to addcapacity to the network. In addition, the dragonfly networks describedso far assumed uniform bandwidth to all nodes in the network. However,if such uniform bandwidth is not needed, bandwidth tapering can beimplemented by removing inter-group channels among some of the groups.

Dragonfly Routine Examples

A variety of minimal and non-minimal routing algorithms can beimplemented using the dragonfly topology. Some embodiments of globaladaptive routing using local information lead to limited throughput andvery high latency at intermediate loads. To overcome these problems, weintroduce new mechanisms to global adaptive routing, which provideperformance that approaches an ideal implementation of global adaptiverouting.

Minimal routing in a dragonfly from source node s attached to router Rsin group Gs to destination node d attached to router Rd in group Gdtraverses a single global channel and is accomplished in three steps:

-   -   Step 1: If Gs_=Gd and Rs does not have a connection to Gd, route        within Gs from Rs to Ra, a router that has a global channel to        Gd.    -   Step 2: If Gs_=Gd, traverse the global channel from Ra to reach        router Rb in Gd.    -   Step 3: If Rb_=Rd, route within Gd from Rb to Rd.

This minimal routing works well for load-balanced traffic, but resultsin poor performance on adversarial traffic patterns. To load-balanceadversarial traffic patterns, Valiant's algorithm can be applied at thesystem level—routing each packet first to a randomly-selectedintermediate group Gi and then to its final destination d. ApplyingValiant's algorithm to groups suffices to balance load on both theglobal and local channels. This randomized non-minimal routing traversesat most two global channels and requires five steps:

-   -   Step 1: If Gs_=Gi and Rs does not have a connection to Gi, route        within Gs from Rs to Ra, a router that has a global channel to        Gi.    -   Step 2: If Gs_=Gi traverse the global channel from Ra to reach        router Rx in Gi.    -   Step 3: If Gi_=Gd and Rx does not have a connection to Gd, route        within Gi from Rx to Ry, a router that has a global channel to        Gd.    -   Step 4: If Gi_=Gd, traverse the global channel from Ry to router        Rb in Gd.    -   Step 5: If Rb_=Rd, route within Gd from Rb to Rd.

To prevent routing deadlock, two virtual channels (VCs) are employed forminimal routing and three VCs are required for non-minimal routing, asshown in FIG. 5. These virtual router assignments eliminate all channeldependencies due to routing. For some applications, additional virtualchannels may be required to avoid protocol deadlock—e.g., for sharedmemory systems, separate sets of virtual channels may be required forrequest and reply messages.

A variety of routing algorithms for the dragonfly topology have beenevaluated, including:

-   -   Minimal (MIN): The minimal path is taken as described        previously.    -   Valiant (VAL) [32]: Randomized non-minimal routing as described        previously.    -   Universal Globally-Adaptive Load-balanced [29] (UGALG,UGAL-L)        UGAL chooses between MIN and VAL on a packet-by-packet basis to        load-balance the network. The choice is made by using queue        length and hop count to estimate network delay and choosing the        path with minimum delay. We implement two versions of UGAL.    -   UGAL-L—uses local queue information at the current router node.    -   UGAL-G—uses queue information for all the global channels in        Gs—assuming knowledge of queue lengths on other routers. While        difficult to implement, this represents an ideal implementation        of UGAL since the load-balancing is required of the global        channels, not the local channels.

The different routing algorithms are evaluated using both benign andadversarial synthetic traffic patterns, as shown in FIG. 6. Latency v.offered load is shown for the four routing algorithms, using bothuniform random traffic at 601 and adversarial traffic at 602. The use ofa synthetic traffic pattern allows us to stress the topology and routingalgorithm to fully evaluate the network. For benign traffic such asuniform random (UR), MIN is sufficient to provide low latency and highthroughput, as shown at 601 of FIG. 6. VAL achieves approximately halfof the network capacity because its load-balancing doubles the load onthe global channels. Both UGAL-G and UGAL-L approach the throughput ofMIN, but with slightly higher latency near saturation. The higherlatency is caused by the use of parallel or greedy allocation where therouting decision at each port is made in parallel. The use of sequentialallocation will reduce the latency at the expense of a more complexallocator.

Adaptive routing on the dragonfly is challenging because it is theglobal channels, the group outputs, that need to be balanced, not therouter outputs. This leads to an indirect routing problem. Each routerpicks a global channel to use using only local information that dependsonly indirectly on the state of the global channels. Previous globaladaptive routing methods used local queue information, source queues andoutput queues, to generate accurate estimates of network congestion. Inthese cases, the local queues were an accurate proxy of globalcongestion, because they directly indicated congestion on the routesthey initiated. With the dragonfly topology, however, local queues onlysense congestion on a global channel via backpressure over the localchannels. If the local channels are overprovisioned, significant numbersof packets must be enqueued on the overloaded minimal route before thesource router will sense the congestion. This results in a degradationin throughput and latency as shown earlier in FIG. 6 at 602.

A throughput issue with UGAL-L arises due to a single local channelhandling both minimal and non-minimal traffic. For example, in FIG. 7, apacket in R1 has a minimal path which uses gc7 and a nonminimal pathwhich uses gc6. Both paths share the same local channel from R1 to R2.Because both paths share the same local queue (and hence have the samequeue occupancy) and the minimal path is shorter (one global hop vstwo), the minimal channel will always be selected, even when it issaturated. This leads to the minimal global channel being overloaded andthe non-minimal global channels that share the same router as theminimal channel being under utilized. With UGAL-G, the minimal channelis preferred and the load is uniformly balanced across all other globalchannels. With UGAL-L, on the other hand, the non-minimal channels onthe router that contains the minimal global channel are underutilized—resulting in a degradation of network throughput.

To overcome this limitation, we modify the UGAL algorithm to separatethe queue occupancy into minimal and nonminimal components by usingindividual VCs (UGAL-LVC).

  if (qm vcHm ≦ qnm vcHnm)  route minimally; else  route nonminimally;where the subscript m and nm denote the minimal and nonminimal paths. Ifthe VC assignment of FIG. 5 is used, qm vc=q(V C1) and qnm vc=q(V C0).

When compared, UGAL-LVC matches the throughput of UGAL-G on a WC trafficpattern but for UR traffic, the throughput is limited, withapproximately 30% reduction in throughput. For the WC traffic, wheremost of the traffic needs to be sent non-minimally, UGALLVC performswell since the minimal queue is heavily loaded. However, forload-balanced traffic when most traffic should be sent minimally,individual VCs do not provide an accurate representation of the channelcongestion—resulting in throughput degradation.

To overcome this limitation, we further modify the UGAL algorithm toseparate the queue occupancy into minimal and non-minimal componentsonly when the minimal and nonminimal paths start with the same outputport. Our hybrid modified UGAL routing algorithm (UGAL-LVC H) is:

  if (qmHm ≦ qnmHnm && Outm_= Outnm) | | (qm vcHm ≦ qnm vcHnm && Outm =Outnm)  route minimally; else  route nonminimally;

Compared to UGAL-LVC, UGAL-LVC H provides the same throughput on WCtraffic pattern but matches the throughput of UGAL-G on UR traffic butresulting in nearly 2× higher latency at an offered load of 0.8, nearsaturation. For WC traffic, UGAL-LVC H also results in higherintermediate latency compared to UGAL-G.

The high intermediate latency of UGAL-L is due to minimally-routedpackets having to fill the channel buffers between the source and thepoint of congestion before congestion is sensed. Our research shows thatnon-minimally routed packets have a latency curve comparable to UGAL-Gwhile minimally-routed packets see significantly higher latency. Asinput buffers are increased, the latency of minimally-routed packetsincreases and is proportional to the depth of the buffers. A histogramof latency distribution shows two clear distributions—one largedistribution with low latency for the non-minimal packets and anotherdistribution with a limited number of packets but with much higherlatency for the minimal packets.

To understand this problem with UGAL-L, in the example dragonfly groupshown in FIG. 7, assume a packet in R1 is making its global adaptiverouting decision of routing either minimally through gc0 ornon-minimally through gc7. The routing decision needs to load balanceglobal channel utilization and ideally, the channel utilization can beobtained from the queues associated with the global channels, q0 and q3.However, q0 and q3 queue informations are only available at R0 and R2and not readily available at R1—thus, the routing decision can only bemade indirectly through the local queue information available at R1.

In this example, q1 reflects the state of q0 and q2 reflects the stateof q3. When either q0 or q3 is full, the flow control providesbackpressure to q1 and q2 as shown with the arrows in FIG. 7. As aresult, in steady-state measurement, these local queue information canbe used to accurately measure the throughput. Since the throughput isdefined as the offered load when the latency goes to infinity (or thequeue occupancy goes to infinity), this local queue information issufficient. However, q0 needs to be completely full in order for q1 toreflect the congestion of gc0 and allow R1 to route packetsnon-minimally. Thus, using local information requires sacrificing somepackets to properly determine the congestion—resulting in packets beingsent minimally having much higher latency. As the load increases,although minimally routed packets continue to increase in latency, morepackets are sent non-minimally and results in a decrease in averagelatency until saturation.

In order for local queues to provide a good estimate of globalcongestion, the global queues need to be completely full and provide astiff backpressure towards the local queues. The stiffness of thebackpressure is inversely proportional to the depth of the buffer—withdeeper buffers, it takes longer for the backpressure to propagate whilewith shallower buffers, a much stiffer backpressure is provided. As thebuffer size decreases, the latency at intermediate load is decreasedbecause of the stiffer backpressure. However, using smaller bufferscomes at the cost of reduced network throughput.

To overcome the high intermediate latency, we propose using creditround-trip latency to sense congestion faster and reduce latency. Incredit-based flow control, illustrated in FIGS. 8A-8B, credit counts aremaintained for buffers downstream. As packets are sent downstream, theappropriate credit count is decremented and once the packet leavesdownstream router, credits are sent back upstream and the credit countis incremented. The latency for the credits to return is referred to ascredit round-trip latency (tcrt) and if there is congestion downstream,the packet cannot be immediately processed and results in an increase intcrt.

Referring to FIG. 8A, conventional credit flow control is illustrated at801. As packets are sent downstream (1), the output credit count isdecremented (2) and credits are sent back upstream (3). This scheme ismodified as shown in FIG. 8B at 802 to use credit round trip latency toestimate congestion in the network. In addition to the output creditcount being decremented (2), the time stamp is pushed into the credittime queue, denoted CTQ. Before sending the credit back upstream (4),the credit is delayed (3), and when downstream credits are received (5),the credit count is updated as well as the credit round trip latencytcrt.

The value of tcrt can be used to estimate the congestion of globalchannels. By using this information to delay upstream credits, westiffen the backpressure and more rapidly propagate congestioninformation up stream. For each output O, tcrt(O) is measured and thequantity td(O)=tcrt(O)−tcrt0 is stored in a register. Then, when a flitis sent to output O, instead of immediately sending a credit backupstream, the credit is delayed by td(O)−min [td(o)]. The credits sentacross the global channels are not delayed. This ensures that there isno cyclic loop in this mechanism and allows the global channels to befully utilized.

The delay of returning credits provides the appearance of shallowerbuffers to create a stiff backpressure. However, to ensure that theentire buffer gets utilized and there is no reduced throughput at highload, the credits needs to delayed by the variance of td across alloutputs. We estimate the variance by finding min [td(o)] value and usingthe difference. By delaying credits, the upstream routers observescongestion at a faster rate (compared to waiting for the queues to fillup) and leads to better global adaptive routing decisions.

The UGAL-L routing algorithm evaluation using credit latency (UGAL-LCR)is investigated for both WC and UR traffic using buffers of depth 16 and256. UGAL-LCR leads to significant reduction in latency compared toUGALL and approaches the latency of UGAL-G. For WC traffic, UGAL-LCRreduces latency by up to 35% with 16 buffers and up to over 20×reduction in intermediate latency with 256 buffers compared to UGAL-L.Unlike UGAL-L, the intermediate latency with UGAL-LCR is independent ofbuffer size. For UR traffic, UGAL-LCR provides up to 50% latencyreduction near saturation compared to UGAL-LVC H. However, both UGAL-LCRand UGALLVC H fall short of the throughput of UGAL-G with UR trafficbecause their imprecise local information results in some packets beingrouted non-minimally.

The implementation of this scheme results in minimal complexity overheadas the following three features are needed at each router:

-   -   tracking credits individually to measure tcrt    -   registers to store td values    -   a delay mechanism in returning credits        The amount of storage required for td is minimal as only O(k)        registers are required. The credits are often returned by        piggybacking on data flits and delaying credits to wait for the        transmission of the next data flit upstream is required. The        proposed mechanism only requires adding additional delay.

As for tracking individual credits, credits are conventionally trackedas a pool of credits in credit flow control—i.e., a single creditcounter is maintained for each output VC and increments when a credit isreceived. The implementation of UGAL-LCR requires tracking each creditindividually. This can be done by pushing a timestamp on the tail of aqueue each time a flit is sent, as shown in FIG. 17( b) with the use ofa credit timestamp queue (CTQ), and popping the timestamp off the headof the queue when the corresponding credit arrives. Because flits andcredits are 1:1 and maintain ordering, the simple queue suffices tomeasure round-trip credit latency. The depth of the queue needs to beproportional to the depth of the data buffers but the queue size can bereduced to utilize imprecise information to measure congestion—e.g., byhaving a queue which is only ¼ of the data buffer size, only one of fourcredits are tracked to measure the congestion.

The cost of a dragonfly topology also compares favorably to a flattenedbutterfly, as well as to other topologies. The flattened butterflytopology reduces network cost of a butterfly by removing intermediaterouters and channels. As a result, the flattened butterfly reduces costby approximately 50% compared to a folded-Clos on balanced traffic. Thedragonfly topology extends the flattened butterfly by increasing theeffective radix of the routers to further reduce the cost and increasethe scalability of the network.

A comparison of dragonfly and flattened butterfly networks of 64 k nodesshows that a flattened butterfly uses 50% of the router ports for globalchannels, while a dragonfly uses 25% of the ports for globalconnections. The flattened butterfly requires two additional dimensions,while the dragonfly is a single dimension. In addition, the dragonflyprovides better scalability because the group size can be increased toscale the network whereas scaling the flattened butterfly requiresadding additional dimensions. With the hop count nearly identical, thedragonfly trades off longer global cables for smaller number of globalcables required to provide a more cost-efficient topology better matchedto emerging signaling technologies.

Various embodiments of dragonfly networks described here also comprisetwo new variants of global adaptive routing that overcome the challengeof indirect adaptive routing presented by the dragonfly. A dragonflyrouter will typically make a routing decision based on the state of aglobal channel attached to a different router in the same group.Conventional global adaptive routing algorithms that use local queueoccupancies to infer the state of this remote channel give degradedthroughput and latency. We introduce the selective use of virtualchannel discrimination to overcome the bandwidth degradation. We alsointroduce the use of credit round-trip latency to both sense and signalchannel congestion. The combination of these two techniques gives aglobal adaptive routing algorithm that attempts to approach theperformance of an ideal algorithm with perfect knowledge of remotechannel state.

Progressive Adaptive Routing in a Dragonfly Network

An improved routing method for Dragonfly processor interconnect networksis proposed here, providing deadlock-safe adaptive routing that isoperable to choose among multiple legal routes based on congestion ordown links. This adaptive routing method provides improved routingperformance and tolerance for downed or busy links than prior methods,and explicitly communicates congestion across channels as opposed towithholding credits, which may negatively impact bandwidth.

In some embodiments, a network route is selected from among multipleminimal routes, such as routing in different dimensions first, andoptionally further selected from one or more non-minimal routes, such asusing randomly chosen hops to avoid congestion or downed links.

Routing choices are presented via tables in one example, and may bebiased toward certain routes or toward minimal or non-minimal routesdepending on the network configuration and state. For example, routechoice may be biased toward minimal routing by default for highestefficiency, with a bias switch toward non-minimal routing to protect acertain network link from arbitrarily or unnecessarily receivingadditional traffic.

Congestion information is utilized in some embodiments by deriving ananticipated next link congestion from elements such as counting thenumber of messages in an output queue and establishing a receivingbuffer congestion estimate based on factors such as credits or messagesin-flight. A node can query a potential receiving node for the average“next link” output congestion, enabling the node to make a routingdecision based on avoiding congested or down links.

FIG. 9 shows a Dragonfly network router, consistent with an exampleembodiment of the invention. The router block shown here comprises 48tiles, with each tile corresponding to an input/output pair. The tilesare organized in an 8×6 matrix, such that incoming packet data at aparticular tile is routed across the row to one of the 8 columns, thenup or down the 8 columns to one of the 6 rows, arriving at theappropriate tile for output. The channels in further embodiments featuremultiple virtual channels, virtual channel switching in-flight, errorcorrection such as SECDED, and input buffering including dynamicallocation to virtual channels as needed to improve network performance.

Referring again to the example of FIG. 9, forty of the tiles connect toexternal network links, while eight of the tiles connect to processorcores local to the processor node. Each tile comprises an input queue, asubswitch, and a column buffer. The input queue receives packets from aserializer/deserializer interface to the network, and determines how toroute the packet. The packet is sent across the row bus to the subswitchin the appropriate column. The subswitch receives the packets, switchesthem to the appropriate virtual channel, and sends the packet out one ofthe six column buses to the column buffer in the appropriate row. Thecolumn buffer collects the packet data from the six tiles within thecolumn and sends the packet data across the network channel.

The dragonfly network topology in this example is a hierarchical networkof two layers of a flattened butterfly topology. The first layer is atwo-dimensional flattened butterfly that connects all of the routerchips within a local group, such as a computer cabinet or chassis. Eachgroup is treated as a very high-radix router, and a single dimensionflattened butterfly (all-to-all) connects all of the groups to form thesecond layer of the dragonfly topology example presented here.

The first dimension within the group, referred to for convenience as the“green” dimension, connects the 16 routers within a chassis. The seconddimension within a group is similarly called the “black” dimension, andconnects the six chassis within a two cabinet group. This is reflectedin the network configuration shown in the network “group” of FIG. 10,which illustrates six chassis (represented as the six rows), made up of16 routers per chassis (represented as the 16 columns).

Groups such as are illustrated in FIG. 10 are further coupled to oneanother using links in the “blue” dimension, as shown in FIG. 11. These“blue” links between groups connect each group to each other group, to amaximum of 240 blue links per group in this example, or 241 groups persystem. Each link can comprise multiple ports, such as four ports perlink or optical cable, resulting in four ports connecting each pair ofgroups over a single cable. In systems having fewer groups, unused portsfrom the 240 blue ports per group can be used to provide additionalbandwidth between configured groups, such as two links per group pair ina network having 120 groups providing eight ports connecting each pairof groups.

In the network, packets route from a source node to a target node,traversing at least one but possibly all three dimensions shown in FIGS.9-11. A routing path traversing all three dimensions will likely firstbe routed in the green dimension and then the black dimension to reachthe appropriate node in a group to link to the target group, then theblue dimension to reach the intended target group. The packet is thenrouted in the green and black dimensions within the group to reach theintended target node in the target group, resulting in five routingswithin three dimensions to reach the target.

The network supports both adaptive and deterministic routing in oneembodiment.

Deterministic routing sends a given packet over a predetermined routeover the network irrespective of network congestion. When multipledeterministic paths are available, deterministic traffic can be hashedbased on a destination node, address, or other such characteristics todistribute traffic between the multiple paths. Packets traveling betweenthe same source and target will in some embodiments arrive at the targetin order, as all packets between the source and target take the samedeterministic path.

Adaptive routing permits packets to take different routes based oncongestion levels within the network. In some embodiments, packets mayarrive out of order when using adaptive routing, and may takenon-minimal paths when congestion dictates avoiding a minimal path.

Minimal routing in a dragonfly occurs when a packet traverses at mostone link in a given dimension. Minimal routing within a group, such asshown in FIG. 10, will therefore take at the most one hop in the “green”dimension and one hop in the “black” dimension. A minimal path betweennodes in different groups will take at most one hop in the greendimension and one hop in the black dimension in each group, and willtake one additional hop to travel from the source group to the targetgroup.

As either the black or green dimensions may be traversed first, thereare multiple minimal paths, both in the source and destination groups.If multiple links between groups exist, one path may not require a hopin the black or green dimension in either the source or destinationgroups, reducing the total number of hops needed to complete a minimalpath to less than five.

Non-minimal routing can take multiple hops in either the black or greendimension in the source or target groups, resulting in more than fivehops. Additional hops may be desirable in circumstances where congestionis present in the minimal path or paths available to the router,improving the speed of message delivery to the target while avoidingfurther congesting an already congested network link. Furtherembodiments attempt to spread traffic over available links, such as byrandomizing or hashing path selection to avoid creating additionalcongested network regions as a result of repeatedly routing the samepath around a previously congested link.

In one such embodiment, an intermediate node is chosen in the group suchas that of FIG. 10, such that the message is first minimally routed tothe intermediate node, and then routed from the intermediate node to thefinal node in the group. This results in up to two hops in each of thegreen and black dimensions, or double the number of hops in minimalrouting within a group. Routing may be nonminimal within the sourcegroup, nonminimal within the target group, or nonminimal in both thesource and target groups.

Nonminimal routing can also occur between groups, such as where amessage is routed minimally within the source and target groups but isrouted through an intermediate group between the source and targetgroups to avoid congestion in the link between the source and targetgroups. Routing within the source, intermediate, and target groups mayfurther be minimal or nonminimal, depending on congestion within each ofthe groups.

The type of routing used for a given packet or message is determined inone embodiment by a routing control field in the packet header. Forexample, the routing control symbol may indicate that deterministicnon-minimal hashed routing is to be used when preserving packet order isdesired. Packets are distributed across available paths using the targetnode as a hash. Traffic is routed nonminimally, but distributing thepackets among various intermediate nodes in the group results in reducedhot spots or congestion.

Deterministic minimal hashed routing provides hashing of packets overminimal paths, which reduces the number of hops in a given group bypermitting routing over alternate minimal paths, such as black dimensionbefore green dimension or green dimension before black dimension. Thiscan result in severe network congestion in certain situations, and somay not be desirable unless global traffic is particularly uniformlydistributed.

Deterministic minimal non-hashed routing uses a single deterministicminimal path for all traffic, which provides packet ordering but doesnot provide good bandwidth or load distribution among available paths.Such routing may be used for infrequent or small messages, such ascontrol messages or latency-critical messages.

Adaptive routing can be sued as a default routing type when ordering isnot required. Packets will attempt to route minimally, but may takenon-minimal paths in groups or between groups to avoid networkcongestion. Adaptive routing is provided in some embodiments usingrouting tables that provide two or more minimal and two or morenon-minimal ports for consideration in making a routing choice. Acongestion value is computed for each node or tile in a router iscalculated and distributed to other tiles in the router, such as therouter tiles shown in FIG. 9. The adaptive routing algorithm considersin this example the two minimal and two nonminimal paths available, andselects from them based on the congestion values and optionally onvarious configured biases.

Port congestion values are derived in a further embodiment from factorssuch as downstream port congestion, estimated far-end link congestion,and near-end link congestion. In a specific example, two bits ofdownstream port congestion information are propagated across theexternal channel corresponding to each tile in a router chip, andupdated periodically. These bits will be generated at the transmittingrouter chip by combining a view of congestion of downstream ports on thechip. The downstream ports that are combined into this 2-bit congestionvalue are selected via an MMR-configurable mask at each tile. Thecongestion values of these downstream ports are summed and compared tothree programmable thresholds. If the sum is greater than the highestthreshold, the congestion is 2′b11. If the sum is less than the highestthreshold, but greater than the middle threshold, the congestion is2′b10. If the sum is less than the middle threshold and greater than thelowest threshold, the congestion is 2′b01. Otherwise, if the sum is lessthan the lowest threshold, the congestion is 2′b00.

On the receiving side of the channel, this 2-bit value is mapped to a4-bit value by indexing into a 4-entry by 4-bit wide downstreamcongestion remapping table. The estimated far-end link congestion iscomputed by tracking the number of flits sent longer than the channelround trip latency in the past that have not yet been acknowledged, andadjusting by the relative rates of flit transmission and acknowledgementreceipt. The mechanism used to do this is a 5-bit wide 32-entry deepdelay chain. For an MMR-configurable number of cycles (1 to 31), therouter counts the number of flits transmitted into the tail position ofthis delay chain. After this delay, all of the values are shifted. Thetotal expected outstanding flits on the channel (transmitted and onesfor this an ack is expected) is the sum of the values in this chain.This value is compared to the outstanding credit count. The total numberof outstanding credits minus the expected flits on the channelrepresents an estimate for the number of flits stored in the remoteInput Queue.

The estimated far-end congestion is calculated as a 10-bit number. Thisnumber is converted to a 4-bit index according to a mapping table, andthis 4-bit number is then remapped to another programmable 4-bit valueby indexing into a 16-entry far-end congestion remapping table.

The near-end link congestion is computed by summing the flits queued inthe column buffer waiting to be transmitted across the link. This sum isalso a 10-bit value and is converted to a 4-bit value according to amapping table. This 4-bit number is then remapped to anotherprogrammable 4-bit value by indexing into a 16-entry near-end congestionremapping table.

The remapped 4-bit downstream port congestion value, the remapped 4-bitfar-end link congestion value, and the remapped 4-bit near-end linkcongestion value are combined to produce a single 4-bit congestion valueper tile. This combination is done as a 3-input 4-bit unsignedsaturating addition. This 4-bit congestion value is propagated to allother tiles on the chip to aid those tiles in making informed adaptivechoices.

A “link alive” signal is broadcast from each ntile on the chip to allother tiles on the chip. This link alive signal for each ntile indicateswhether the corresponding tile has an established serial link with therouter it is connected to. Ports for which the link is not alive will beconsidered invalid from a port selection perspective. This allows therouter to adaptively avoid recently failed links which software has notyet been able to remove from the routing tables.

The link alive signals are propagated around the router via a 2-wireserial chain that connects all of the network tiles. Each tile placesits link status information on the serial chain at the appropriate bittiming. If all of the ports presented to the congestion logic areinvalid, the packet will be discarded. In this case, it will be up toend-point hardware to timeout on the missing packet and up tohigher-level software to retransmit or handle the error as appropriate.

At each Input Queue, the broadcast congestion values are used in makingthe adaptive choice between the two minimal and two non-minimal portcandidates. Before using these congestion values, bias values areapplied to the selected two minimal and non-minimal port congestionvalues. First, the values are logically extended to a 6-bit value byprepending two zeros to the most significant part of the value. Theadaptive routing control type (adaptive0, adaptive1, adaptive2, oradaptive3) is used to select a set of biases from a four entry biastable. Each entry has a pair of 2-bit shift value that determines howfar left to shift the minimal ports and non-minimal ports congestionvalues respectively. The 6-bit expanded congestion value can be shiftedby zero, one, or two bits. The encoding of this field is 2′b00=shiftleft by zero bits (multiply by one), 2′b01=shift left by one bit(multiply by two), 2′b10=shift left by two bits (multiply by four),2′b11=reserved.

Each bias MMR also contains a pair of 6-bit values that is added to the6-bit expanded minimal and non-minimal congestion values. The additionis performed as a saturating add, resulting in a 6-bit number. The portcorresponding to the lowest congestion is picked. If there is a tiebetween a minimal and a non-minimal port, the router favors the minimalport. If there is a tie between the two ports presented as non-minimalor between the two ports presented as minimal, the choice is arbitraryand may be made in any suitable way.

Table-Driven Routine Mechanism in a Dragonfly Network

The routing example presented here uses a variety of tables to determinepaths available in routing a packet or message, and provide routingflexibility in the dragonfly network configuration. Different tablesexist to provide routing within a group and between groups, and forminimal and non-minimal routing paths.

The routing structures in the example router architecture presented hereare divided into four distinct table sets: a global non-minimal (GN)table set, a global minimal (GM) table, a local non-minimal (LN) tableset, and a local minimal (LM) table. The logical flow of this specificexample is shown in FIG. 12.

The global tables are used to determine how to route to a remote group,when the current group is not the target group. They are used to routetoward a particular optical port on which to exit the local group. Localtables are used to route to a particular router chip within the currentgroup. They are used for “up” or “down” routing within the group forlocal routing or for “up” routing in the intermediate group. Minimaltables specify minimal local or global routes. They are used whenrouting down or, in the case of adaptive routing, when attempting totake a minimal path on the way up. Non-minimal tables specifynon-minimal paths, and are only used when routing “up”. They alsoprovide a “root-detect” mechanism for determining when to stop routingup.

The global non-minimal table set is used to route non-minimal traffic toan intermediate group. It contains a list of ports that lead to “safe”intermediate groups, where a “safe” intermediate group is one that isconnected to all other groups. (In a healthy network, all groups aresafe. In a partially healthy network, the tables should be programmed toavoid sending traffic to an intermediate group that may not connect tothe target group.) This table set consists of three tables. The firsttable selects which rank in the green dimension to traverse to leave thecurrent (source) group. The second table selects the black dimension totraverse. The third table selects the optical port to leave the currentrouter chip on.

The tables are arranged hierarchically in a fixed priority order. Thegreen dimension table has the highest priority, and the blue dimensiontable has the lowest. Each table lists a set of port numbers to leavethe Aries on, or a special value that indicates that the current tableis deferring its priority and the next table in the priority hierarchyshould be consulted. A special value on the lowest priority (blue)table, if referenced, will result in an error condition. Each tableconsists of 128 entries, each of which is a 6-bit port number or thespecial value of 6′b11xxxx. Each table is organized as 16 by 8 entries,with an accompany 7-bit ECC per each block of 8 entries.

This table should only contain routes to other router chips or opticalport numbers that ultimately lead to an intermediate group that cansafely route to all other groups in the system. The table also providesthe mechanism that distributes non-minimal traffic roughly evenly overthe groups in the system. There are 128 entries in each table so thateven with an effective radix-18 dimension, each port is listed 7 or 8times, leading to at most a 14.3% imbalance between two ports in thedimension. This imbalance can be minimized by having the imbalancedports differ on the multiple copies of the table throughout the group.

For global deterministic routing, this table set is indexed into by ahash value including the target, the tgtID, (possibly the local portnumber), and the optional hash field from the packet header (which comesfrom the packet address). Each table will get a different index. Forglobal adaptive routing, one of the blocks of 8 entries is selected fromthe table at random. A second entry is selected at random from that8-entry block. The two ports are compared with each other and with twoentries from the global minimal table to determine which path to routethe packet.

The green tables in the ptiles will generally have each of the 15 greenports listed 8 times and will have 8 special values. Further, at theptiles, the black tables will have each of the 15 black ports listedapproximately 7 times, with approximately 21 entries containing specialvalues. The blue tables will have each of the optical ports listed about13 times each.

The green ntile ports will generally have all of the entries in thegreen table as the special value. The black and blue tables will beconfigured in the same proportions as in the ptile case. The black ntileports will generally have all of the entries in the green and blacktables as the special value. The blue tables will be configured in thesame proportions as the ptiles.

The global minimal table is used to determine a direct path from thecurrent group to the target group. It consists of 256 entries, each ofwhich is 81-bits wide. Each entry is divided in to two parts, a fullport set and a restricted port set. The full port set consist of 8 6-bitport entries and a 3-bit modulo specifier. The modulo field indicatesthe total number of valid ports in the associated entry. The modulospecifier is encoded as the modulo minus one. That is, a value of 7 inthe modulo field will result in a modulo of 8 operation. The restrictedport set consists of 4 G-bit ports and a 2-bit modulo specifier. Each81-bit entry will also have an 8-bit ECC.

This table is organized by target group numbers. Each target groupcorresponds to a “block” of 1, 2, 4, 8, 16, 32, 64, or 128 entries inthe table, according to the size of the system. A system with 241 groupswould have 1 entry per block in the table. (15 of the entries would beunused.) A system with 65-128 groups would use 2 entries per block. Asystem with 33-64 groups would use a block of four entries, and soforth. The group number along with zero to seven additional random(adaptive routing) or hash (deterministic routing) bits are used toindex into the table. Each entry contains a list of ports leading toAries reachable from the current point in routing that connect minimallyto the associated target group, or leading directly to the target groupover a blue link.

The full port set is used when just beginning to route minimally withina group (either at a ptile or an optical ntile) toward another group, orat any tile when routing non-minimally within the intermediate group andthe root is detected in the local non-minimal table (see below). Thisside of the table lists all possible paths to all possible optical portsthat are connected minimally to the group specified by the index. Therestricted port set is used for routing within the group other than inthe root detect and injection cases mention for the full port set table.This half of the table only represents paths in the network that arelegal from the current point in the group network, assuming we arerouting minimally.

The key purpose of the restricted port list is to prevent packets fromflowing back in the direction from whence they came. At a green port,the restricted table entries should normally only list black and blueports. At a black port, the restricted table entries should normallyonly list blue ports.

When all of the ports listed in the restricted set are invalid, thisindicates to the adaptive routing logic that a packet has diverged froma legal minimal path. In this case the adaptive routing logic will pickone of the non-minimal choices. (This should never occur fordeterministically or minimally routed traffic as the tables should bewritten in a consistent manner such that a packet never arrives at apoint where it cannot route to the destination. If this does occur, therouter will flag an error and discard the packet.

When there are no legal restricted routed in a tile, the mod value canbe set to any value. The route table should contain the special value of6′b11xxxx in all of the entries associated with the group number. Whenthere is only one legal route, the port list should contain the legalroute listed at least twice and the mod value set to two or higher tomatch.

For deterministic routing one of the valid entries in either the full orrestricted set is selected by computing a modulo of a hash by the numberof valid entries in the associated index. Like in the cases above,adaptive routing will choose 2 entries from the table but computing themod on a random number and a second modulo of N−1 to add to the firstnumber plus one to get the offset of a second random but unique entry inthe table.

The local non-minimal table set is used to pick a router chip in thelocal group that is used as the root for non-minimal routing within thegroup. This table is used for non-minimal routing when the source andtarget group are the same. It is also used for non-minimal routing inthe intermediate group. This table set is structured like the globalnon-minimal table, except that there is no blue table.

The local non-minimal table is indexed randomly for adaptive routing orby a hash for non-minimal deterministic routing. Similar to the globalnon-minimal table, for adaptive routing two entries are produced by thistable and compared. To reduce the number of total RAM macros in thedesign, these tables will be physically combined with the globalnon-minimal tables in RAM.

This table lists Aries that are reachable from this tile that are safeto use for local non-minimal routing. In a healthy network, the ptilesand blue (optical) tiles should list all Aries in the group roughlyevenly. Approximately 15/16 of the entries in the green table shouldlist green ports, and ˜1/6 should contain the special value indicatingthat the green dimension has already been satisfied and that the blacktable should be used. Similarly, ˜5/6 of the entries in the black tableshould list black ports, and ˜1/6 should contain the special valueindicating that the black dimensions has been satisfied. A special valuein both the green and black tables indicates that the root has beenreached (“root detect”) and that the packet should be downrouted fromthis point.

The green tiles should fill the green table with special values(indicating that the green dimension has been satisfied), and shouldlist the 6 aries reachable (including self, using the special value) inthe black table evenly. The black tiles should fill both the green andblack tables with the special root detect value. The ptiles and opticaltiles need the full table set. Ntiles could technically do without thegreen table, however, the router table example presented here implementsthem for flexibility.

The local minimal table is used for minimal routing (“downrouting”)within the target group, and also when adaptively “uprouting” in thetarget group. This table has 128 entries. Each entry is 52 bits wide,consisting of 8 6-bit port numbers, a “diverged” bit, and a mod valueindicating how many entries are valid in this line of the table. Thediverged bit indicates that the path within the target group hasdiverged from a minimal path, and thus this path cannot be used as aminimal path when adaptively uprouting, and can only be used fordownrouting. It is similar to the case in the global minimal table whereall the ports in the restricted set are invalid.

This table is organized by the “target” Aries number within the group.Each local Aries number corresponds to a block of 1, 2, 4, 8, or 16entries in the table, according to the size of the group. A group of65-128 Aries would used a block size of 1 entry per local Aries number.A group size of 33-64 Aries would use a block size of 2, and so forth.The local Aries number along with zero to four additional random(adaptive routing) or hash (deterministic routing) bits are used toindex into the table. Each entry contains a list of ports leading to theassociated local Aries.

For deterministic routing one of the valid entries in the table isselected by computing a modulo of a hash by the number of valid entriesin the associated index. Like in the cases above, adaptive routing willchoose 2 entries from the table but computing the mod on a random numberand a second modulo of N−1 to add to the first number plus one to getthe offset of a second random but unique entry in the table.

The global non-minimal tables are only used in the source group fortraffic headed to another group. The global non-minimal and localnon-minimal tables are never used concurrently. Therefore, to reduce thetotal number of RAMs needed, the global non-minimal green table isstored in the same RAM as the local non-minimal green table. The globalnon-minimal black table is stored in the same RAM as the localnon-minimal black table. The global table is stored in the lower indexvalue portion of each of those two RAMs.

CONCLUSION

The above examples illustrate how routing in a Dragonfly network can beimproved by using adaptive routing that is able to select a network pathbased on factors such as network congestion or traffic type, and routingtables for various routings including minimal and non-minimal, and localand global routing.

Adaptive routing provides deadlock-safe routing that chooses amongmultiple legal routes based on congestion or down links, providingimproved routing performance and tolerance by explicitly communicatingcongestion across channels. Routing is performed across multiple minimalroutes, such as routing in different dimensions first, and optionallyfurther selected from one or more non-minimal routes, such as usingrandomly chosen hops to avoid congestion or downed links.

Congestion information is based on anticipated next link congestion fromelements such as counting the number of messages in an output queue andestablishing a receiving buffer congestion estimate through factors suchas credits or messages in-flight. A node can query a potential receivingnode for the average “next link” output congestion, enabling the node tomake a routing decision based on avoiding congested or down links. Otherfeatures, such as using a deterministic hash or a random number tospread traffic in choosing a routing path are also provided, and areuseful in spreading traffic to prevent congestion.

Routing choices are presented via tables in one example, and may bebiased toward certain routes or toward minimal or non-minimal routesdepending on the network configuration and state. For example, routechoice may be biased toward minimal routing by default for highestefficiency, with a bias switch toward non-minimal routing to protect acertain network link from arbitrarily or unnecessarily receivingadditional traffic. In a further example, routing tables include tableshaving local and global routing tables, and minimal and non-minimalpaths.

Although specific embodiments have been illustrated and describedherein, it will be appreciated by those of ordinary skill in the artthat any arrangement which is calculated to achieve the same purpose maybe substituted for the specific embodiments shown. This application isintended to cover any adaptations or variations of the exampleembodiments of the invention described herein. It is intended that thisinvention be limited only by the claims, and the full scope ofequivalents thereof.

What is claimed is:
 1. An apparatus comprising: at least one firstinterface to couple to one or more processors in a plurality ofprocessors; at least one second interface to couple to a plurality ofrouters in a particular group of routers, wherein the particular groupis one of a plurality of groups in a dragonfly topology, each of thegroups in the plurality of groups comprises a respective plurality ofrouters, and each router is to be connected to every other router in itsgroup; at least one third interface to support an optical channel tocouple the particular group to one other group in the plurality ofgroups, wherein each group in the plurality of groups is connected toeach other group in the plurality of groups by an optical channel; androuting logic to use at least one of a plurality of routing tables foruse in routing data between processors in the plurality of processors.2. The apparatus of claim 1, wherein the first interface comprises oneor more ports, the second interface comprises one or more ports, and thethird interface comprises one or more optical ports.
 3. The apparatus ofclaim 1, wherein data is to be routed over at most two local channelhops and at most one global optical channel hops.
 4. The apparatus ofclaim 1, wherein at least one of the routing tables facilitates minimalrouting.
 5. The apparatus of claim 1, wherein at least one of therouting tables facilitates non-minimal routing.
 6. The apparatus ofclaim 1, wherein at least one of the routing tables facilitates adaptiverouting.
 7. The apparatus of claim 6, wherein adaptive routing balancesload across global channels that interconnect groups in the plurality ofgroups.
 8. The apparatus of claim 1, wherein at least one of the routingtables is used for routing between groups and at least one of therouting tables is used for routing between router modules within aparticular one of the plurality of groups.
 9. A system comprising: aplurality of processors nodes; and a plurality of router modules,wherein each of the plurality of router modules are coupled to one ormore of the plurality of processor nodes, the plurality of routermodules comprise a dragonfly topology network comprising a plurality ofgroups, each group in the plurality of groups comprises a respectivesubset of the plurality of router modules, the router modules of eachgroup are interconnected to each other router module in the group by arespective local channel, each group in the plurality of groups isinterconnected with each other group in the plurality of groups by arespective optical channel, and one or more routing tables are to beassociated with each of the plurality of groups.
 10. The system of claim9, further comprising routing logic to use the routing tables to routedata between two of the plurality of router modules.
 11. The system ofclaim 9, wherein the plurality of router modules form one or more highradix routers.
 12. The system of claim 9, wherein data is to be routedover at most two local channel hops and at most one global opticalchannel hops.
 13. The system of claim 9, wherein at least one of therouting tables facilitates minimal routing.
 14. The system of claim 9,wherein at least one of the routing tables facilitates non-minimalrouting.
 15. The system of claim 9, wherein at least one of the routingtables facilitates adaptive routing.
 16. The system of claim 9, whereinadaptive routing balances load across global channels that interconnectgroups in the plurality of groups.
 17. The system of claim 9, wherein atleast one of the plurality of groups corresponds to a server chassis.18. The system of claim 17, further comprising a plurality of serverchassis, wherein global optical channels interconnect the server chassiswith at least one other server chassis.
 19. The system of claim 9,wherein at least one of the routing tables is used for routing betweengroups and at least one of the routing tables is used for routingbetween router modules within a particular one of the plurality ofgroups.
 20. An apparatus comprising: a plurality of processors; aplurality of router modules to interconnect in a dragonfly topology,wherein the dragonfly topology is to comprise a plurality of groups,each router module comprises: at least one first interface to couple therouter module to one or more of the plurality of processors; at leastone second interface to couple the router module to every other routermodule in its respective group of router modules; and at least one thirdinterface to couple the group of router modules of the router module toanother one of the plurality of groups of router modules over a globaloptical channel, and routing logic to use one or more of a plurality ofrouting tables to route data.
 21. The apparatus of claim 20, whereindata is to be routed over at most two local channel hops and at most oneglobal optical channel hops.
 22. The apparatus of claim 20, wherein thefirst interface comprises one or more ports, the second interfacecomprises one or more ports, and the third interface comprises one ormore optical ports.
 23. The apparatus of claim 20, wherein at least oneof the routing tables facilitates minimal routing.
 24. The apparatus ofclaim 20, wherein at least one of the routing tables facilitatesnon-minimal routing.
 25. The apparatus of claim 20, wherein at least oneof the routing tables facilitates adaptive routing.
 26. The apparatus ofclaim 20, wherein at least one of the routing tables is used for routingbetween groups and at least one of the routing tables is used forrouting between router modules within a particular one of the pluralityof groups.